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Ch I Abstract 

^ ' In comparative genomic, the first step of sequence analysis is usually to decompose two or more 

genomes into syntenic blocks that are segments of homologous chromosomes. For the reliable recovery 
of syntenic blocks, noise and ambiguities in the genomic maps need to be removed first. Maximal 
Strip Recovery (MSR) is an optimization problem proposed by Zheng, Zhu, and Sankoff for reliably 
' recovering syntenic blocks from genomic maps in the midst of noise and ambiguities. Given d genomic 

, maps as sequences of gene markers, the objective of MSR-d is to find d subsequences, one subsequence 

^ ' of each genomic map, such that the total length of syntenic blocks in these subsequences is maximized. 

O , For any constant d > 2, a polynomial-time 2(i-approximation for MSR-d was previously known. In this 

paper, we show that for any d> 2, MSR-d is APX-hard, even for the most basic version of the problem 
, in which all gene markers are distinct and appear in positive orientation in each genomic map. Moreover, 

^ ' we provide the first explicit lower bounds on approximating MSR-d for all d > 2. In particular, we show 

•/^ , that MSR-d is NP-hard to approximate within rt{d/ log d). From the other direction, we show that the 

previous 2(i-approximation for MSR-d can be optimized into a polynomial-time algorithm even if d is 
not a constant but is part of the input. We then extend our inapproximability results to several related 
problems including CMSR-d, (J-gap-MSR-d, and (5-gap-CMSR-c?. 



(N 



o 



Keywords: computational complexity, bioinformatics, sequence analysis, genome rearrangement. 

1 Introduction 



^ • In comparative genomic, the first step of sequence analysis is usually to decompose two or more genomes 

into syntenic blocks that are segments of homologous chromosomes. For the reliable recovery of syntenic 
blocks, noise and ambiguities in the genomic maps need to be removed first. A genomic map is a sequence 
of gene markers. A gene marker appears in a genomic map in either positive or negative orientation. Given 
d genomic maps, Maximal Strip Recovery (MSR-d) is the problem of finding d subsequences, one subse- 
quence of each genomic map, such that the total length of strips of these subsequences is maximized lIZTlfTTI . 
Here a strip is a maximal string of at least two markers such that either the string itself or its signed reversal 
appears contiguously as a substring in each of the d subsequences in the solution. Without loss of generality, 
we can assume that all markers appear in positive orientation in the first genomic map. 

For example, the two genomic maps (the markers in negative orientation are underlined) 

1 2 3 4 5 6 7 8 9 10 11 12 
8 5 7 6 4 1 3 2 12 11 10 9 



*This research was supported in part by NSF grant DBI-0743670. A preliminary version of this paper appeared in two parts 117! 
[181 in the Proceedings of the 20th International Symposium on Algorithms and Computation (ISAAC 2009) and the Proceedings 
of the 4th International Frontiers of Algorithmics Workshop (PAW 2010). 
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have two subsequences 

1 3 6 7 8 10 11 12 

876 13 121110 

of the maximum total strip length 8. The strip (1, 3) is positive and forward in both subsequences; the other 
two strips (6, 7, 8) and (10, 11, 12) are positive and forward in the first subsequence, but are negative and 
backward in the second subsequence. Intuitively, the strips are syntenic blocks, and the deleted markers not 
in the strips are noise and ambiguities in the genomic maps. 

The problem MSR-2 was introduced by Zheng, Zhu, and Sankoff fTT], and was later generalized to 
MSR-d for any d > 2 by Chen, Fu, Jiang, and Zhu [11 1. For MSR-2, Zheng et al. L27J presented a 
potentially exponential-time heuristic that solves a subproblem of Maximum- Weight Clique. For MSR-d, 
Chen et al. lITTI presented a 2d-approximation based on Bar- Yehuda et al.'s fractional local-ratio algorithm 
for Maximum- Weight Independent Set in d-interval graphs [6|; the running time of this 2(i-approximation 
algorithm is polynomial if d is a constant. 

On the complexity side, Chen et al. lITTI showed that several close variants of the problem MSR-d are 
intractable. In particular, they showed that (i) MSR-2 is NP-complete if duplicate markers are allowed 
in each genomic map, and that (ii) MSR-3 is NP-complete even if the markers in each genomic map are 
distinct. The complexity of MSR-2 with no duplicates, however, was left as an open problem. 

In the biological context, a genomic map may contain duplicate markers as a paralogy set ll27l p. 516], 
but such maps are relatively rare. Thus MSR-2 without duplicates is the most useful version of MSR-d in 
practice. Theoretically, MSR-2 without duplicates is the most basic and hence the most interesting version 
of MSR-d. Also, the previous NP-hardness proofs of both (i) MSR-2 with duphcates and (ii) MSR-3 without 
duplicates | fTT| rely on the fact that a marker may appear in a genomic map in either positive or negative 
orientation. A natural question is whether there is any version of MSR-d that remains NP-hard even if all 
markers in the genomic maps are in positive orientation. 

We give a precise formulation of the most basic version of the problem MSR-d as follows: 

INSTANCE: Given d sequences Gi, 1 < i < d, where each sequence is a permutation of (1, . . . , n). 

QUESTION: Find a subsequence G'^ of each sequence Gi, I < i < d, and find a set of strips Sj, where each 
strip is a sequence of length at least two over the alphabet {1, . . . n}, such that each subsequence G'^ 
is the concatenation of the strips Sj in some order, and the total length of the strips Sj is maximized. 

The main result of this paper is the following theorem that settles the computational complexity of the 
most basic version of Maximal Strip Recovery, and moreover provides the first explicit lower bounds on 
approximating MSR-d for all d > 2: 

Theorem 1. MSR-d /or any d > 2 is APX-hard. Moreover, MSR-2, MSR-3, MSR-4, and MSR-d are 
NP-hard to approximate within 1.000431, 1.002114, 1.010661, and Q{d/ logd), respectively, even if all 
markers are distinct and appear in positive orientation in each genomic map. 

Recall that for any constant d > 2, MSR-d admits a polynomial-time 2d-approximation algorithm ifTTI . 
Thus MSR-d for any constant d > 2 is APX-complete. Our following theorem gives a polynomial-time 
2d-approximation algorithm for MSR-d even if the number d of genomic maps is not a constant but is part 
of the input: 

Theorem 2. For any d > 2, there is a polynomial-time 2d-approximation algorithm for MSR-d if all 
markers are distinct in each genomic map. This holds even if d is not a constant but is part of the input. 
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Compare the upper bound of 2d in Theorem |2] and the asymptotic lower bound oiQ{d/ log d) in Theo- 
rem [T] 

Maximal Strip Recovery lIZTlfTTI is a maximization problem. Wang and Zhu f26l| introduced Comple- 
ment Maximal Strip Recovery as a minimization problem. Given d genomic maps as input, the problem 
CMSR-d is the same as the problem MSR-d except that the objective is minimizing the number of deleted 
markers not in the strips, instead of maximizing the number of markers in the strips. A natural question is 
whether a polynomial-time approximation scheme may be obtained for this problem. Our following theorem 
shows that unless NP = P, CMSR-d cannot be approximated arbitrarily well: 

Theorem 3. CMSR-d/or any d > 2 is APX-hard. Moreover, CMSR-2, CMSR-3, CMSR-4, and CMSR-d 
for any d > 173 are NP-hard to approximate within 1.000625, 1.0101215, 1.0202429, and | - 0(log d/d), 
respectively, even if all markers are distinct and appear in positive orientation in each genomic map. If the 
number d of genomic maps is not a constant but is part of the input, then CMSR-d is NP-hard to approximate 
within any constant less than lO-v/5 — 21 = 1.3606 . . ., even if all markers are distinct and appear in positive 
orientation in each genomic map. 

Note the similarity between Theorem[T]and Theorem[3i In fact, our proof of Theorem |3] uses exactly the 
same constructions as our proof of Theorem [T] The only difference is in the analysis of the approximation 
lower bounds. 

Bulteau, Fertin, and Rusu fW\ recently proposed a restricted variant of Maximal Strip Recovery called 
(5-gap-MSR, which is MSR-2 with the additional constraint that at most 5 markers may be deleted between 
any two adjacent markers of a strip in each genomic map. We now define J-gap-MSR-d and (J-gap-CMSR-d 
as the restricted variants of the two problems MSR-d and CMSR-d, respectively, with the additional 5- 
gap constraint. Bulteau et al. fTO'l proved that (5-gap-MSR-2 is APX-hard for any 5 >2, and is NP-hard for 
5 = 1. We extend our proofs of Theorem[T]and Theorem[3]to obtain the following theorem on (5-gap-MSR-d 
and (5-gap-CMSR-d for any 5>2: 

Theorem 4. Let 5 >2. Then 

(1) 5-gdip-MSR-dforany d>2 is APX-hard. Moreover, (^-gap-MSR-2, (^-gap-MSR-3, (5-gap-MSR-4, and 
(5-gap-MSR-(i are NP-hard to approximate within 1.000431, 1.002114, 1.010661, and d/2^^'^^\ 
respectively, even if all markers are distinct and appear in positive orientation in each genomic map. 

(2) 5-gap-CMSR-d for any d > 2 is APX-hard. Moreover, (5-gap-CMSR-2, (5-gap-CMSR-3, 5-gap- 
CMSR-4, WJ-gap-CMSR-d/oranjd > 173 are NP-hard to approximate within 1.000625, 1.0101215, 
1.0202429, and | — 0{logd/d), respectively, even if all markers are distinct and appear in pos- 
itive orientation in each genomic map. If the number d of genomic maps is not a constant but 
is part of the input, then 5-gap-CMSR-d is NP-hard to approximate within any constant less than 
10\/5 — 21 = 1.3606 . . ., even if all markers are distinct and appear in positive orientation in each 
genomic map. 

We refer to |[T3l 1201 191 for some related results. Maximal Strip Recovery is a typical combinatorial prob- 
lem in biological sequence analysis, in particular, genome rearrangement. The earliest inapproximability 
result for genome rearrangement problems is due to Berman and Karpinski Q, who proved that Sorting by 
Reversals is NP-hard to approximate within any constant less than More recently, Zhu and Wang 1281 
proved that Translocation Distance is NP-hard to approximate within any constant less than Simi- 
lar inapproximability results have also been obtained for other important problems in bioinformatics. For 
example, Nagashima and Yamazaki ll23l proved that Non-overlapping Local Alignment is NP-hard to ap- 
proximate within any constant less than ||||, and Manthey [22] proved that Multiple Sequence Alignment 
with weighted sum-of-pairs score is APX-hard for arbitrary metric scoring functions over the binary alpha- 
bet. 
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The rest of this paper is organized as follows. We first review some preliminaries in Section |2] Then, in 
Sections mm 15] and|6l we show that MSR-d for any d > 2 is APX-hard, and prove explicit approximation 
lower bounds. (For any two constants d and d' such that d' > d > 2, the problem MSR-d is a special 
case of the problem MSR-d' with d' — d redundant genomic maps. Thus the APX-hardness of MSR-2 
implies the APX-hardness of MSR-d for all constants d > 2. To present the ideas progressively, however, 
we show that MSR-4, MSR-3, and MSR-2 are APX-hard by three different L-reductions of increasing 
sophistication.) In Section |2l we present a 2(i-approximation algorithm for MSR-d that runs in polynomial 
time even if the number d of genomic maps is not a constant but is part of the input. In Section [H we present 
inapproximability results for CMSR-d, (J-gap-MSR-d, and ^-gap-CMSR-d. We conclude with remarks in 
Section m 



2 Preliminaries 

L-reduction. Given two optimization problems X and Y, an L-reduction |[24l from X to Y consists of 
two polynomial-time functions / and g and two positive constants a and /3 satisfying the following two 
properties: 

1. For every instance x of X, f{x) is an instance of Y such that 

opt(/(x)) < a • opt(x), (1) 



2. For every feasible solution y to /(x), g{y) is a feasible solution to x such that 

|opt(x) - val(5(y))| < /3 • |opt(/(x)) - val(y)|. (2) 



Here opt(x) denotes the value of the optimal solution to an instance x, and val(y) denotes the value 
of a solution y. The two properties of L-reduction imply the following inequality on the relative errors of 
approximation: 

|opt(x) - val(5(y))| |opt(/(x)) - val(y)| 

T~\ ^ CKp • , „ . . . . 

opt(x) opt(/(x)) 

A relative error of e corresponds to an approximation factor of 1 + e for a minimization problem, and 
corresponds to an approximation factor of for a maximization problem. Thus we have the following 
propositions: 

1 . For a minimization problem X and a minimization problem Y, if X is NP-hard to approximate within 
1 + a/3e, then Y is NP-hard to approximate within 1 + e. 

2. For a maximization problem X and a maximization problem Y, if X is NP-hard to approximate within 
■jrr^, then Y is NP-hard to approximate within 

3. For a minimization problem X and a maximization problem Y, if X is NP-hard to approximate within 
1 + afie, then Y is NP-hard to approximate within 

4. For a maximization problem X and a minimization problem Y, if X is NP-hard to approximate within 
Trr-g-, then Y is NP-hard to approximate within 1 + e. 
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APX-hard optimization problems. We review the complexities of some APX-hard optimization prob- 
lems that will be used in our reductions. 

• Max-IS-A is the problem Maximum Independent Set in graphs of maximum degree A. Max-IS-3 
is APX-hard; see |4|. Moreover, Chlebik and Chlebikova |12| showed that Max-IS-3 and Max-IS-4 
are NP-hard to approximate within 1.010661 and 1.0215517, respectively. Trevisan |[25]| showed that 
Max-IS-A is NP-hard to approximate within A/2'^(^^°S'^). 

• Min-VC-A is the problem Minimum Vertex Cover in graphs of maximum degree A. Min-VC-3 is 
APX-hard; see S. Moreover, Chlebik and Chlebikova HU showed that Min-VC-3 and Min-VC- 
4 are NP-hard to approximate within 1.0101215 and 1.0202429, respectively, and, for any A > 
228, Min-VC-A is NP-hard to approximate within | - 0(log A/A). Dinur and Safra |14| showed 
that Minimum Vertex Cover is NP-hard to approximate within any constant less than 10\/5 — 21 = 
1.3606.... 

• Given a set Xofn variables and a set C of m clauses, where each variable has exactly p literals (in p 
different clauses) and each clause is the disjunction of exactly q literals (of q different variables), Ep- 
Occ-Max-Eg-SAT is the problem of finding an assignment of X that satisfies the maximum number 
of clauses in C. Note that np = mq. Berman and Karpinski ||8|| showed that E3-Occ-Max-E2-SAT is 
NP-hard to approximate within any constant less than 

• Given d disjoint sets Vi of vertices, I < i < d, and given a set C Vi x • • • x of hyper- 
edges, d-Dimensional-Matching is the problem of finding a maximum-cardinality subset M C of 
pairwise-disjoint hyper-edges. Hazan, Safra, and Schwartz |[T6]| showed that d-Dimensional-Matching 
is NP-hard to approximate within Q{d/ log d). 

Linear forest and linear arboricity. A linear forest is a graph in which every connected component is 
a path. The linear arboricity of a graph is the minimum number of linear forests into which the edges of 
the graph can be decomposed. Akiyama, Exoo, and Harary IS conjectured that the linear arboricity of 
every graph G of maximum degree A satisfies la(G) < [(A + l)/2]. This conjecture has been confirmed 
for graphs of small constant degrees, and has been shown to be asymptotically correct as A — )• oo [5 1. In 
particular, the proof of the conjecture for A = 3 and 4 are constructive |l2]|Tl|3l and lead to polynomial-time 
algorithms for decomposing any graph of maximum degree A = 3 and 4 into at most [(A + l)/2] = 2 
and 3 linear forests, respectively. Also, the proof of the first upper bound on linear arboricity by Akiyama, 
Exoo, and Harary L3j| implies a simple polynomial-time algorithm for decomposing any graph of maximum 
degree A into at most [3[A/2] /2] linear forests. 
Define 

/(A) = max/(G), 

where G ranges over all graphs of maximum degree A, and f{G) denotes the number of linear forests that 
Akiyama, Exoo, and Harary 's algorithm |i31 decomposes G into. Then 

r(A + l)/2l </(A)< r3rA/2l/2l. (3) 

3 MSR-4 is APX-hard 

In this section, we prove that MSR-4 is APX-hard by a simple L-reduction from Max-IS-3. Before we 
present the L-reduction, we first show that MSR-4 is NP-hard by a reduction in the classical style, which 
is perhaps more familiar to most readers. Throughout this paper, we follow this progressive format of 
presentation. 
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3.1 NP-hardness reduction from Max-IS-3 to MSR-4 

Let G be a graph of maximum degree 3. Let n be the number of vertices in G. Partition the edges of G 
into two linear forests Ei and £'2. Let Vi and V2 be the vertices of G that are not incident to any edges in 
El and in E2, respectively. We construct four genomic maps G_j>, G^, Gi, and G2, where each map is a 
permutation of the following 2n distinct markers all in positive orientation: 

i i 

• n pairs of vertex markers C and D, 1 < i < n. 

G^ and G^ are concatenations of the n pairs of vertex markers with ascending and descending indices, 
respectively: 




CD • • • CD 

n n 11 

CD • • • CD 



Gi and G2 are represented schematically as follows: 



(£^1) and {E2) consist of vertex markers of the vertices incident to the edges in Ei and E2, respectively. 
The markers of the vertices in each path V1V2 ■ ■ ■ Vk are grouped together in an interleaving pattern: 
for 1 < i < A;, the left marker of Vi, the right marker of Vi^i (if i > 1), the left marker of Vi-^i (if 
i < k), and the right marker of Vi are consecutive. 

(Vi) and (V2) consist of vertex markers of the vertices in Vi and V2, respectively. The left marker and the 
right marker of each pair are consecutive. 

This completes the construction. We refer to Figure [T] (a) and (b) for an example. 

Two pairs of markers intersect in a genomic map if a marker of one pair appears between the two markers 
of the other pair. The following property of our construction is obvious: 

Proposition 1. Two vertices are adjacent in the graph G if and only if the corresponding two pairs of vertex 
markers intersect in one of the two genomic maps Gi, G2. 

We say that four subsequences of the four genomic maps G_> , G^ , Gi , G2 are canonical if each strip 
of the subsequences is a pair of vertex markers. We have the following lemma on canonical subsequences: 

Lemma 1. In any four subsequences of the four genomic maps G^, G<_, Gi, G2, respectively, each strip 
must be a pair of vertex markers. 

Proof. By construction, a strip cannot include two vertex markers of different indices because they appear 
in different orders in G^ and in G^. □ 

The following lemma establishes the NP-hardness of MSR-4: 

Lemma 2. The graph G has an independent set of at least k vertices if and only if the four genomic maps 
G_i>, G^, Gi, G2 have four subsequences whose total strip length I is at least 2k. 

Proof. We first prove the "only if" direction. Suppose that the graph G has an independent set of at least k 
vertices. We will show that the four genomic maps G^, G^, Gi, G2 have four subsequences of total strip 
length at least 2k. By Proposition [B the k vertices in the independent set correspond to k pairs of vertex 
markers that do not intersect each other in the genomic maps. These k pairs of vertex markers induce a 
subsequence of length 2k in each genomic map. In each subsequence, the left marker and the right marker 



Gi: {El) 
G2: {E2) 



(Vi) 
{V2) 
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112233445566778899 
CDCDCDCDCDCDCDCDCD 
998877665544332211 
CDCDCDCDCDCDCDCDCD 
121324354656 778899 
CCDCDCDCDCDD CD CD CD 
17187383 494696 2255 
CCDCDCDD CCDCDD CDCD 



(b) 



22446688 
CDCDCDCD 
88664422 
CDCDCDCD 
22446688 
CDCDCDCD 
88446622 
CDCDCDCD 



(c) 

Figure 1: (a) The graph G: Ei is a single solid path (1, 2, 3, 4, 5, 6), E2 consists of two dotted paths (1, 7, 8, 3) and 
(4,9,6), Vi — {7,8,9},V2 — {2,5}. (b) The four genomic maps G^, G^, Gi, G2. (c) The four subsequences of the 
genomic maps corresponding to the independent set {2, 4, 6, 8} in the graph. 



of each pair appear consecutively and compose a strip. Thus the total strip length is at least 2k. We refer to 
Figure [2c) for an example. 

We next prove the "if" direction. Suppose that the four genomic maps G->,G^,Gi,G2 have four 
subsequences of total strip length at least 2k. We will show that the graph G has an independent set of at 
least k vertices. By Lemma [H each strip of the subsequences must be a pair of vertex markers. Thus we 
obtain at least k pairs of vertex markers that do not intersect each other in the genomic maps. Then, by 
Proposition [H the corresponding set of at least k vertices in the graph G form an independent set. □ 

3.2 L-reduction from Max-IS-3 to MSR-4 

We present an L-reduction (/, g, a, /3) from Max-IS-3 to MSR-4 as follows. The function /, given a graph G 
of maximum degree 3, constructs the four g cnomic m&ps , , G\ , G^ as in the NP-hardness reduction. 
Let k* be the number of vertices in a maximum independent set in G, and let 1* be the maximum total strip 
length of any four subsequences of G_s>, G^,Gi,G2, respectively. By Lemma|2l we have 

I* = 2k*. 

Choose a = 2, then property ([T]) of L-reduction is satisfied. 

The function g, given four subsequences of the four genomic maps G^, G^, Gi, G2, respectively, re- 
turns an independent set of vertices in the graph G corresponding to the pairs of vertex markers that are 
strips of the subsequences. Let I be the total strip length of the subsequences, and let k be the number of 
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vertices in the independent set returned by the function g. Then k>l/2.\t follows that 

\k* -k\ = k* -k< l*/2 - 1/2 = \r - l\/2. 

Choose /3 = 1/2, then property ^ of L-reduction is also satisfied. 

We have obtained an L-reduction from Max-IS-3 to MSR-4 with a/3 = 1. Chlebik and Chlebfkova lfT2l 
showed that Max-IS-3 is NP-hard to approximate within 1.010661. It follows that MSR-4 is also NP-hard 
to approximate within 1.010661. The lower bound extends to MSR-d for all constants d > 4. 

The L-reduction from Max-IS-3 to MSR-4 can be obviously generalized: 

Lemma 3. Let A > 3 and d > A. If there is a polynomial-time algorithm for decomposing any graph of 
maximum degree A into d — 2 linear forests, then there is an L-reduction from Max-IS-A to MSR-d with 
constants a = 2 and (3 = 1/2. 

4 MSR-3 is APX-hard 

In this section, we prove that MSR-3 is APX-hard by a slightly more sophisticated L-reduction again from 
Max-IS-3. 

4.1 NP-hardness reduction from Max-IS-3 to MSR-3 

Let G be a graph of maximum degree 3. Let n be the number of vertices in G. Partition the edges of G into 
two linear forests Ei and E2. Let Vi and V2 be the vertices of G that are not incident to any edges in Ei 
and E2, respectively. We construct three genomic maps Go, Gi, and G2, where each map is a permutation 
of the following 4n distinct markers all in positive orientation: 

i i 

• n pairs of vertex markers C and D, 1 < i < n; 

i i 

• n pairs of dummy markers IZ and □, 1 < z < n. 

Go consists of the 2n pairs of vertex and dummy markers in an alternating pattern: 

1111 n n n n 

CD ••• CD 

Gi and G2 are represented schematically as follows: 

Gi: (Fi) {D) 
G2: {D) {E2) {V2) 

(El) and (£'2) consist of vertex markers of the vertices incident to the edges in Ei and E2, respectively. 
The markers of the vertices in each path V1V2 ■ ■ ■ are grouped together in an interleaving pattern: 
for 1 < i < A;, the left marker of Vi, the right marker of Vi^i (if i > 1), the left marker of Vj+i (if 
i < k), and the right marker of Vi are consecutive. 

(Vi) and (V2) consist of vertex markers of the vertices in Vi and V2, respectively. The left marker and the 
right marker of each pair are consecutive. 

(D) is the reverse permutation of the n pairs of dummy markers: 

n n 11 

c □ • • • c □ 
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111122223333444455556666777788889999 
CDCDCDIZDCDIZDCDIZDCDIZDCDIZDCDIZDCDIZDCDIZZ] 
778899 121324354656 998877665544332211 
CDCDCD CCDCDCDCDCDD □□□□□□□□□□□□□□□□□□ 
998877665544332211 17187383 494696 2255 
C □ C □ C □ C □ C □ C □ C □ C □ C □ CCDCDCDD CCDCDD CDCD 



(b) 



11222233444455666677888899 
□ □CDCDIZDCDIZDIZDCDIZDIZDCDIZDIZD 
88224466 998877665544332211 
CDCDCDCD □□□□□□□□□□□□□□□□□□ 
998877665544 3 3 2211 88446622 
C □ C □ C □ C □ C □ C □ C □ C □ C □ CDCDCDCD 



(c) 

Figure 2: (a) The graph G: Ei is a single (solid) path (1, 2, 3, 4, 5, 6), E2 consists of two (dotted) paths (1, 7, 8, 3) 
and (4, 9, 6), Vi — {7, 8, 9}, V2 = {2, 5}. (b) The three genomic maps Go, Gi, G2. (c) The three subsequences of the 
genomic maps corresponding to the independent set {2, 4, 6, 8} in the graph. 



This completes the construction. We refer to Figure |2] (a) and (b) for an example. 

It is clear that Proposition [T] still holds. The following lemma on canonical subsequences is analogous 
to Lemma [T] 

Lemma 4. If the three genomic maps Go,Gi, G2 have three subsequences of total strip length I, then they 
must have three subsequences of total strip length at least I such that (i) each strip is either a pair of vertex 
markers or a pair of dummy markers, and (ii) each pair of dummy markers is a strip. 

Proof. We present an algorithm that transforms the subsequences into canonical form without reducing the 
total strip length. By construction, a strip cannot include both a dummy marker and a vertex marker because 
they appear in different orders in Gi and in G2, and a strip cannot include two dummy markers of different 
indices because they appear in different orders in Gq and in Gi and G2. Suppose that a strip S consists of 
vertex markers of two or more different indices. Then there must be two vertex markers ji and v of different 
indices i and j that are consecutive in S. Since the vertex markers and the dummy markers appear in Go in 
an alternating pattern with ascending indices, we must have i < j. Moreover, the pair of dummy markers 
of index i, which appears between /x and in Go, must be missing from the subsequences. Now cut the 
strip S into 5^ and Si, between fi and u. If (resp. S,y) consists of only one marker // (resp. u), delete the 
lone marker from the subsequences (recall that a strip must include at least two markers). This decreases 
the total strip length by at most two. Next insert the pair of dummy markers of index i to the subsequences 
as a new strip. This increases the total strip length by exactly two. Repeat this operation whenever a strip 
contains two vertex markers of different indices and whenever a pair of dummy markers is missing from the 
subsequences, then in 0{n) steps we obtain three subsequences of total strip length at least / in canonical 
form. □ 

The following lemma, analogous to LemmaEl establishes the NP-hardness of MSR-3: 
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Lemma 5. The graph G has an independent set of at least k vertices if and only if the three genomic maps 
Gq,Gi, G2 have three subsequences whose total strip length I is at least 2(n + k). 

Proof. We first prove the "only if" direction. Suppose that the graph G has an independent set of at least k 
vertices. We will show that the three genomic maps Go, Gi , G2 have three subsequences of total strip length 
at least 2{n + k). By Proposition [H the k vertices in the independent set correspond to k pairs of vertex 
markers that do not intersect each other in the genomic maps. These k pairs of vertex markers together with 
the n pairs of dummy markers induce a subsequence of length 2{n + k) in each genomic map. In each 
subsequence, the left marker and the right marker of each pair appear consecutively and compose a strip. 
Thus the total strip length is at least 2{n + k). We refer to Figure |2tc) for an example. 

We next prove the "if" direction. Suppose that the three genomic maps Go, Gi, G2 have three subse- 
quences of total strip length at least 2{n + k). We will show that the graph G has an independent set of at 
least k vertices. By Lemma IH the three genomic maps have three subsequences of total strip length at least 
2{n -\- k) such that each strip is a pair of markers. Excluding the n pairs of dummy markers, we obtain at 
least k pairs of vertex markers that do not intersect each other in the genomic maps. Then, by Proposition [H 
the corresponding set of at least k vertices in the graph G form an independent set. □ 

4.2 L-reduction from Max-IS-3 to MSR-3 

We present an L-reduction (/, g, a, /?) from Max-IS-3 to MSR-3 as follows. The function /, given a graph 
G of maximum degree 3, constructs the three genomic maps Go, Gi, G2 as in the NP-hai^dness reduction. 
Let k* be the number of vertices in a maximum independent set in G, and let /* be the maximum total 
strip length of any three subsequences of Go, Gi, G2, respectively. Since a simple greedy algorithm (which 
repeatedly selects a vertex not adjacent to the previously selected vertices) finds an independent set of at 
least n/(3 + 1) vertices in the graph G of maximum degree 3, we have k* > n/(3 + 1). By Lemma[5l we 
have r =2{n + k*).lt follows that 

I* = 2(n + k*) < 2((3 + l)k* + k*) = 2(3 + 2)k* = 10k*. 

Choose a = 10, then property ([T]) of L-reduction is satisfied. 

The function g, given three subsequences of the three genomic maps Go,Gi,G2, respectively, trans- 
forms the subsequences into canonical form as in the proof of Lemma HI then returns an independent set of 
vertices in the graph G corresponding to the pairs of vertex markers that are strips of the subsequences. Let 
/ be the total strip length of the subsequences, and let k be the number of vertices in the independent set 
returned by the function g. Then A; > //2 — n. It follows that 

\k* -k\ = k* -k< {l*/2 - n) - {1/2 -n) = \l* - l\/2. 

Choose /3 = 1/2, then property ^ of L-reduction is also satisfied. 

We have obtained an L-reduction from Max-IS-3 to MSR-3 with a/3 = 5. Chlebik and Chlebikova |[T2l 
showed that Max-IS-3 is NP-hard to approximate within 1.010661 = 01066I) • follows that 

MSR-3 is NP-hard to approximate within ]^_(^;^_;^^;^'^Q;^QgQ;^jy5 = 1.002114 .... 

5 MSR-2 is APX-hard 

In this section, we prove that MSR-2 is APX-hard by an L-reduction from Ep-Occ-Max-Eg-SAT with p = 3 
and q>2. 
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5.1 NP-hardness reduction from Ep-Occ-Max-Eg-SAT to MSR-2 

Let {X, C) be an instance of Ep-Occ-Max-Egi-SAT, where X is a set of n variables Xi, 1 < i < n, and C is 
a set of m clauses Cj, 1 < j < m. Without loss of generality, assume that the p literals of each variable 
are neither all positive nor all negative. Since jo = 3, it follows that each variable has either 2 positive and 1 
negative literals, or 1 positive and 2 negative literals. 

We construct two genomic maps Gi and G2, each map a permutation of 2(5n + m + qm + 2) distinct 
markers all in positive orientation: 

i i 

• 1 pair of variable markers < > for each variable Xi, 1 < i < n; 

1,1 1,1 i,2 1,2 

• 2 pairs of true markers M ► and M ► for each variable Xi, 1 < i < n; 

1,1 i,l 1,2 1,2 

• 2 pairs of false markers < > and < > for each variable Xi, 1 < i < n; 

j j 

• 1 pair of clause markers (e 2) for each clause Cj, 1 < j < m; 

j,t j,t 

• q pairs of literal markers C D, 1 < t < g, for each clause Cj, 1 < j < m; 

11 22 

• 2 pairs of dummy markers IZ □ and IZ 

The construction is done in two steps: first arrange the variable markers, the true/false markers, the 
clause markers, and the dummy markers into two sequences Gi and G2, next insert the hteral markers at 
appropriate positions in the two sequences to obtain the two genomic maps Gi and G2. 

The two sequences G\ and G2 are represented schematically as follows: 

1122 11 mm 11 n n 

Gi : (xi) • • • {xn) □□□□ (eD ••• <> ••• <> 

mm 11 2211 

G2 ■ {Xn) ■ ■ ■ {Xl) (£2) ••• dD □□□□ 

1,1 1,1 i,2 1,2 i,l 1,1 

For each variable Xi, (xj) consists of the corresponding four pairs of true/false markers -4 ► -4 ► < > 

< > in Gi and G2, and in addition the pair of variable markers < > in G2. These markers are arranged in 
the two sequences in a special pattern as follows (the indices i are omitted for simpler notations): 

1212 2121 
< < > > <\ < > > 

1111 2222 

Now insert the literal markers to the two sequences Gi and G2 to obtain the two genomic maps Gi 
and G2. First, Gi — > Gi. For each positive literal (resp. negative literal) of a variable Xj that occurs in a 

j,t j,t i,s i,s 

clause Gj, place a pair of literal markers C D, 1 < t < q, around a false marker < (resp. true marker ►), 
1 < s < 2. The four possible positions of the three pairs of literal markers of each variable Xi are as follows: 

1212 2121 

1111 2222 

Next, G2 ^ G2. Without loss of generality, assume that the q pairs of literal markers of each clause Gj 
appear in G\ with ascending indices: 

C D ••• C D 
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Insert the q pairs of literal markers in G2 immediately after the pair of clause markers d D, in an interleaving 
pattern: 

C ••• C D ••• D 
This completes the construction. We refer to Figure [3] (a) and (b) for an example of the two steps. 



1,1 1,2 1,1 1,2 1,2 1,1 1,2 1,1 2,1 2,2 2,1 2,2 2,2 2,1 2,2 2,1 1 1 2 2 1 1 2 2 3 3 1 1 2 2 
<1-^>^<1-^I>^<1-^[>^<I-^I>^CDIZD(S3)(S3)(S3)<><> 
2,1 2,1 2,1 2,1 2 2 2,2 2,2 2,2 2,2 1,1 1,1 1,1 1,1 1 1 1,2 1,2 1,2 1,2 3 3 2 2 1 1 2 2 1 1 
-^^►><><]-^[>^-^<]^I><><]-^[>^(e3)(£3)^3)1ZZIIZD 



(a) 

1,1 1.1 1,1 1,2 1,1 3.1 1,2 3,1 2,1 1.2 2.1 1.1 1,2 1,1 1,2 2.1 1,2 2,2 2,1 2,2 2.2 2,2 2,2 2,1 2,2 3,2 2,1 3,2 1 1 2 2 1 1 2 2 3 3 1 1 2 2 
C<D-^>C^DC<ID-*>^C<ID-<I>C^D<-<>C^DIZD[ZZ|iE=)(E=)(=3><><> 
2,1 2.1 2,1 2,1 2 2 2,2 2,2 2.2 2.2 1.1 1.1 1.1 1,1 1 1 1,2 1.2 1.2 1,2 3 3 3,2 3,1 3,2 3,1 2 2 2.2 2.1 2.2 2,1 1 1 1,2 1,1 1,2 1.1 2 2 1 1 
•<<^><><l-<l>^-*<»>C><><l-<l[>^g=)CCDD(==>CCDD(==)CCDDIZZICD 

(b) 



1.1 1,1 1,2 1,2 2,1 2,1 1,1 1,1 2,1 2,1 2,2 2,2 3.2 3,2 1 1 2 2 1 1 2 2 3 3 1 1 2 2 

CD-^^CD-<^<1I><1I>CDIZDIZD(e3)(s3)^3)<><> 

2.1 2,1 2 2 2.2 2.2 1,1 1,1 1 1 1,2 1,2 3 3 3,2 3,2 2 2 2,1 2,1 1 1 1,1 1,1 2 2 1 1 

<lt><><ll>-^^<>-^^(E3)CD(E3)CD(E3)CDIZDIZD 



(c) 

1,1 1,1 1,2 1,2 1,1 1,1 2,1 2,1 2,2 2,2 2,2 2,2 3,2 3,2 1 1 2 2 1 1 2 2 3 3 1 1 2 2 

CD-^^-<^<|[>CD<l[>CDIZDIZD(E3)(e3)^3)<><> 

2.1 2,1 2 2 2.2 2.2 1,1 1,1 1 1 1,2 1,2 3 3 3,2 3,2 2 2 2.2 2,2 1 1 1.1 1,1 2 2 1 1 

(d) 

Figure3: MSR-2 construction for the E3-Occ-Max-E2-S AT instance Ci = xiW X2, C2 = xi Vx2, and C3 = iiVi2- 
(a) The two sequences Gi and G2- (b) The two genomic maps Gi and G2. (c) Two canonical subsequences for the 
assignment xi = true and X2 — false, (d) Two other canonical subsequences for the assignment xi = true and 
X2 = false. 

We say that two subsequences of the two genomic maps Gi and G2 are canonical if each strip of the 
two subsequences is a pair of markers. We refer to Figure [3] (c) and (d) for two examples of canonical 
subsequences. The following lemma on canonical subsequences is analogous to Lemma[T]and LemmaSJ 

Lemma 6. If the two genomic maps Gi and G2 have two subsequences of total strip length I, then they 
must have two subsequences of total strip length at least I such that each strip is a pair of markers and, 
moreover, (i) the two pairs of dummy markers are two strips, (ii) the m pairs of clause markers and the n 
pairs of variable markers are m + n strips, (iii) at most one pair of literal markers of each clause is a strip, 
(iv) either both pairs of true markers or both pairs of false markers of each variable are two strips. 

Proof. We present an algorithm that transforms the subsequences into canonical form without reducing 
the total strip length. The algorithm performs incremental operations on the subsequences such that the 
following eight conditions are satisfied progressively: 

1. Each strip that includes a dummy marker is a pair of dummy markers. A strip cannot include two 
dummy markers of different indices because they appear in different orders in Gi and in G2. Note that in G2 
the dummy markers appear after the other markers. Suppose that a strip S includes both a dummy marker 
and a non-dummy marker. Then there must be a non-dummy marker /i and a dummy marker u consecutive 
in S. Since the two pairs of dummy markers appear consecutively but in different orders in Gi and in G2, 
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one of the two pairs must appear between and v either in G\ or in G2. This pair is hence missing from 
the subsequences. Now cut the strip S into S"^ and Sy between [l and v. If 5^ (resp. S^) consists of only 
one marker /i (resp. v), delete the lone marker from the subsequences (recall that a strip must include at 
least two markers). This decreases the total strip length by at most two. Next insert the missing pair of 
dummy markers to the subsequences. This pair of dummy markers becomes either a new strip by itself, or 
part of a longer strip (recall that a strip must be maximal). In any case, the insertion increases the total strip 
length by exactly two. Overall, this cut-delete-insert operation (also used in Lemma|4| does not reduce the 
total strip length. After the first operation, a second operation may be necessary. But since each operation 
here deletes only lone markers (in and 3,^) and inserts always a pair of markers, the pair inserted by one 
operation is never deleted by a subsequent operation. Thus at most two operations are sufficient to transform 
the subsequences until each strip that includes a dummy marker is indeed a pair of dummy markers. 

2. The two pairs of dummy markers are two strips. Suppose that the subsequences do not have both 
pairs of dummy markers as strips. Then, by condition 1, we must have either both pairs of dummy markers 
missing from the subsequences, or one pair missing and the other pair forming a strip. Note that in Gi the 
dummy markers separate the true/false and literal markers on the left from the clause and variable markers 
on the right, and that in G2 the dummy markers appear after the other markers. If the missing dummy 
markers do not disrupt any existing strips in Gi, then simply insert each missing pair to the subsequences 
as a new strip. Otherwise, there must be a true/false or literal marker /x and a clause or variable marker u 
consecutive in a strip S, such that both pairs of dummy markers appear in Gi between jj, and u and hence 
are missing from the subsequences. Cut the strip S between fi and v, delete any lone markers if necessary, 
then insert the two pairs of dummy markers to the subsequences as two new strips. 

3. Each strip that includes a clause or variable marker is a pair of clause markers or a pair of variable 
markers. Note that in Gi the clause and variable markers are separated by the dummy markers from the other 
markers. Thus, by condition 2, a strip that includes a clause or variable marker cannot include any markers 
of the other types. Also, a strip cannot include two clause markers of different clauses, or two variable 
markers of different variables, or a clause marker and a variable marker, because these combinations appear 
in different orders in Gi and in G2. Thus this condition is automatically satisfied after conditions 1 and 2. 

4. The m pairs of clause markers and the n pairs of variable markers are m + n strips. Suppose that 
the subsequences do not have all m + n pairs of clause and variable markers as m + n strips. By condition 3, 
the clause and variable markers in the subsequences must be in pairs, each pair forming a strip. Then the 
clause and variable markers missing from the subsequences must be in pairs too. For each missing pair of 
clause or variable markers, if the pair does not disrupt any existing strips in G2, then simply insert it to the 
subsequences as a new strip. Otherwise, there must be two true/false or hteral markers /i and u consecutive 
in a strip S, such that the missing pair appears in G2 between fi and v. Cut the strip S between /i and u, 
delete any lone markers if necessary, then insert each missing pair of clause markers between /x and u to the 
subsequences as a new strip. 

5. Each strip that includes a literal marker is a pair of literal markers. Note that in G2 the dummy 
and clause markers separate the literals markers from the other markers, and separate the literal markers of 
different clauses from each other. Thus, by conditions 2 and 4, a strip cannot include both a literal marker 
and a non-literal marker, or two literal markers of different clauses. Suppose that a strip S includes two 
literal markers fi and u of the same clause Gj but of different indices j, s and j, t. Assume without loss of 
generality that /x and u are consecutive in S. Recall the orders of the literal markers of each clause in the 
two genomic maps: 

i,l i,l j,s j,s j,t j,t j,q j,q 

C D ••• C D ••• C D ••• C D 

j,q j,t j,s j,l j,q j,t j,s j,l 

C ••• C • • • C ••• C D ••• D • • • D ••• D 

Since in Gi the pairs of literal markers appear with ascending indices, the index s of the marker fi must 
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be less than the index t of the marker v. Then, since in G2 the left markers appear with descending indices 
before the right markers also with descending indices, fi must be a left marker, and u must be a right marker. 

That is, ^z^ = C D. All markers between fi and ly in Gi must be missing from the subsequences. Among 
these missing markers, those that are literal markers of Cj appear in G2 either consecutively before fi or 

j,s j,t j,t 

consecutively after u. Replace either /i or by a missing literal marker of Cj, that is, either C by C, or D 
by D , then /i and v become a pair. Denote this shift operation by 

j,s j,t j,t j,t j,s j,s 

ixv : CD-;>CDorCD. 
The strip S cannot include any other literal markers of the clause Cj besides ^ and v because (i) the markers 

j,s j,s j,t j,t 

before C in Ci appear after C in G2, and (ii) the markers after D in Ci appear before D in G2. 

6. At most one pair of literal markers of each clause is a strip. Note that the q pairs of literal markers 
of each clause appear in G2 in an interleaving pattern. It follows by condition 5 that at most one of the q 
pairs can be a strip. 

7. Each strip that includes a true/false marker is a pair of true markers or a pair of false markers. 

By conditions 1, 3, and 5, it follows that each strip that includes a true/false marker must include true/false 
markers only. A strip cannot include two true/false markers of different variables because they appear in 
different orders in Gi and in G2. Suppose that a strip S includes two true/false markers fi and v of the same 
variable Xi such that jj. and u are not a pair. Recall the orders of the four pairs of true/false markers of each 
variable Xi in Gi and G2, the four possible positions of the three pairs of literal markers in Gi, and the 
position of the variable marker in G2 '■ 

1212 2121 
C<\D < > C^D C<1D < > C^D 

1111 2222 
<<]>■><><]<>>■ 

Note that the pair of variable markers in G2 forbids a strip from including two true/false markers of 
different indices. Thus the strip S must consist of true/false markers of both the same variable and the same 

index. Assume without loss of generality that /i appears before v in S. It is easy to check that there are only 

11 22 

two such combinations of fj, and i^. either fj^v = <\ >■ or fii^ = \> . Moreover, the strip S must include only 
the two markers /x and u. For either combination of and u, use a shift operation to make and u a pair: 

11 11 11 

22 22 22 
: ■^l>-^-^^or<l>. 

8. Either both pairs of true markers or both pairs of false markers of each variable are two strips. 

Consider the conflict graph of the four pairs of true/false markers and the three pairs of literal markers of 
each variable Xj in Figure HI The graph has one vertex for each pair, and has an edge between two vertices if 
and only if the corresponding pairs intersect in either Gi or G2. By conditions 1, 3, 5, and 7, the strips of the 
subsequences from the seven pairs correspond to an independent set in the conflict graph of seven vertices. 

Note that the four vertices corresponding to the four pairs of true/false markers induce a 4-cycle in the 
conflict graph. Suppose that neither both pairs of true markers nor both pairs of false markers are strips. 
Then at most one of the four pairs, say S, is a strip. Delete S from the subsequences. Recall that each 
variable has either 2 positive and 1 negative literals, or 1 positive and 2 negative literals. Let T be the pair 
of literal markers whose sign is opposite to the sign of the other two pairs of literal markers. Also delete T 
from the subsequences if it is there. Next insert two pairs of true/false markers to the subsequences: if T is 
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a 



<> 



o 



Figure 4: Replacing vertices of the independent set in the conflict graph of the four pairs of true/false markers and the 
three pairs of literal markers of each variable. Vertices in the independent set are black. Edges in the 4-cycle are thick. 
In this example the strip S is first deleted then inserted back. 

i,l i,l i,2 1,2 i,l i,l 

positive, both pairs of false markers < > and < > ; if T is negative, both pairs of true markers < ► and 

1,2 1,2 
< ►. 

When all eight conditions are satisfied, the subsequences are in the desired canonical form. □ 

The following lemma, analogous to Lemma[2]and Lemma[5l establishes the NP-hardness of MSR-2: 

Lemma 7. The variables in X have an assignment that satisfies at least k clauses in C if and only if the two 
genomic maps Gi and G2 have two subsequences whose total strip length I is at least 2(3n + m + k + 2). 

Proof. We first prove the "only if" direction. Suppose that the variables in X have an assignment that 
satisfies at least k clauses in C. We will show that the two genomic maps Gi and G2 have two subsequences 
of total strip length at least 2(3n + m + A; + 2). For each variable x,, choose the two pairs of true markers 
if the variable is assigned true, or the two pairs of false markers if the variable is assigned false. For each 
satisfied clause Cj, choose one pair of literal markers corresponding to a true literal (when there are two or 
more true Uterals, choose any one). Also choose all m + n pairs of clause and variable markers and both 
pairs of dummy markers. The chosen markers induce two subsequences of the two genomic maps. It is easy 
to check that, by construction, the two subsequences have at least 3n + m + /c + 2 strips, each strip forming 
a pair. Thus the total strip length is at least 2(3n + m + k + 2). We refer to Figure [3] (c) and (d) for two 
examples. 

We next prove the "if" direction. Suppose that the two genomic maps Gi and G2 have two subsequences 
of total strip length at least 2(3n + m + k + 2). We will show that the variables in X have an assignment that 
satisfies at least k clauses in C. By Lemma |6l the two genomic maps have two subsequences of total strip 
length at least 2(3n+m+/c+2) such that each strip is a pair and, moreover, the two pairs of dummy markers, 
the m + n pairs of clause and variable markers, at most one pair of literal markers of each clause, and either 
both pairs of true markers or both pairs of false markers of each variable are strips. Thus at least k strips are 
pairs of literal markers, each pair of a different clause. Again it is easy to check that, by construction, the 
assignment of the variables in X to either true or false (corresponding to the choices of either both pairs of 
true markers or both pairs of false markers) satisfies at least k clauses in C (corresponding to the at least k 
pairs of literal markers that are strips). □ 

5.2 L-reduction from Ej9-Occ-Max-Eg-SAT to MSR-2 

We present an L-reduction (/, g, a, /3) from Ep-Occ-Max-Eg-SAT to MSR-3 as follows. The function /, 
given the Ep-Occ-Max-Eq-SAT instance {X, C), constructs the two genomic maps Gi and G2 as in the NP- 
hardness reduction. Let k* be the maximum number of clauses in C that can be satisfied by an assignment of 
X, and let /* be the maximum total strip length of any two subsequences of Gi and G2, respectively. Since 
a random assignment of each variable independently to either true or false with equal probability ^ satisfies 



15 



each disjunctive clause of q literals with probability 1 — j^, we have k* > ^^m. By Lemma|7J we have 
/* = 2(3n + m + k* + 2). Recall that np = mq. It follows that 

r = 2(3n + m + r + 2) = (^6^ + 2^ m + 2k* + 4 < (^(^6^ + 2^ 29~~T ^ ^ ^ f) ^* ' 

The function g, given two subsequences of the two genomic maps Gi and G2, respectively, transforms 
the subsequences into canonical form as in the proof of Lemma [6l then returns an assignment of X corre- 
sponding to the choices of true or false markers. Let I be the total strip length of the subsequences, and let k 
be the number of clauses in C that are satisfied by this assignment. Then > //2 — 3n — m — 2. It follows 
that 

\k* -k\=k* - k< {l*/2 - 3n - m - 2) - {1/2 - 3n - m - 2) = \l* - l\/2. 

Let e > be an arbitrary small constant. Note that by brute force we can check whether k* < 2/e and, 
in the affirmative case, compute an optimal assignment of X that satisfies the maximum number of clauses 
in C, all in m^^^^'^^ time, which is polynomial in m for a constant e. Therefore we can assume without loss 
of generality that k* > 2/e. Then, with the two constants a = (6 ^ + 2) + 2 + 2e and /3 = 1/2, both 
properties ^ and ^ of L-reduction are satisfied. In particular, for p = 3 and q = 2, 

af3 = (3^ + 1 ) + 1 + e = 5 + e. 

\ p ) 21 -1 

Berman and Karpinski lH showed that E3-Occ-Max-E2-SAT is NP-hard to approximate within any 
constant less than |g| = ^_-^y^Q^ . Thus MSR-2 is NP-hard to approximate within any constant less than 

1 1 2320 
lim — — = ■ = = 1.000431 .... 

1 - (l/464)/(5 + e) 1 - 1/2320 2319 

6 An asymptotic lower bound for MSR-d 

In this section, we derive an asymptotic lower bound for approximating MSR-d by an L-reduction from 
d-Dimensional-Matching to MSR-(d + 2). 

6.1 NP-hardness reduction from d-Dimensional-Matching to MSR-(rf + 2) 

Let £' C X • • • X Vrf be a set of n hyper-edges over d disjoint sets Vi of vertices, 1 <i < d. We construct 
two genomic maps and and d genomic maps Gi, I < i < d, where each map is a permutation of 
the following 2n distinct markers all in positive orientation: 

i i 

• n pairs of edge markers C and D, 1 < i < n. 

The two genomic maps G^ and G^ are concatenations of the n pairs of edge markers with ascending 
and descending indices, respectively: 

11 n n 

G^: CD ■■■ CD 

n n 11 

G^ : CD ■■■ CD 

Each genomic map Gi corresponds to a vertex set = {vij \ 1 < j < \Vi\}, 1 < i < d, and is 
represented schematically as follows: 

Gi-. ■■■ {vij) ■ ■ ■ 
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CDCDCDCD 



112 2 

CD CD 



4 4 3 3 2 2 1 1 

CDCDCDCD 




134134 22 

C C CD D D CD 



2323 1414 

C CD D C CD D 



11 234234 

CD C C CD D D 



(a) 



(b) 



Figure 5: MSR-5 construction for the 3-Dimensional-Matching instance Vi = {t'l.i, ^1,2}, V2 = {^2,1, ^'2, 2}, V3 — 
{^'3,1; ^3,2}, and£: = {ei = "2,2, i'3,i)' ^2 = (wi,2, f2,ii ^^3,2), 63 = W2,i: "^3,2), 64 = (wia, ^2,2, ^3,2) }■ 
(a) The five genomic maps G^, G<_, Gi, G2, G3. (b) The five subsequences of the genomic maps corresponding to 
the subset {ei, 62} of pairwise-disjoint hyper-edges. 

Here each (vij) consists of the edge markers of hyper-edges containing the vertex Vij, grouped together 
such that the left markers appear with ascending indices before the right markers also with ascending indices. 
This completes the construction. We refer to Figure [51 a) for an example. 
The following property of our construction is obvious: 

Proposition 2. Two hyper-edges in E intersect if and only if the corresponding two pairs of edge markers 
intersect in one of the d genomic maps Gi, 1 < i < d. 

The following lemma is analogous to Lemma [T] 

Lemma 8. In any d + 2 subsequences of the d + 2 genomic maps , G^ , Gi , . . . , G^, respectively, each 
strip must be a pair of edge markers. 

Proof. By construction, a strip cannot include two edge markers of different indices because they appear in 
different orders in G^ and in G^ . □ 

The following lemma, analogous to Lemma|2l Lemma[5l and Lemma|7l estabhshes the NP-hardness of 



Lemma 9. The set E has a subset ofk pairwise-disjoint hyper-edges if and only if the d-\-2 genomic maps 
G-j., G^, Gi, . . . , G(i have d-\-2 subsequences whose total strip length I is at least 2k. 

Proof. We first prove the "only if" direction. Suppose that the set E has a subset of at least k pairwise- 
disjoint hyper-edges. We will show that the d + 2 genomic maps G_j>, G.j_, Gi, . . . ,Gd have d + 2 subse- 
quences of total strip length at least 2k. By Proposition |2l the k pairwise-disjoint hyper-edges correspond to 
k pairs of edge markers that do not intersect each other in the genomic maps. These k pairs of edge markers 
induce a subsequence of length 2k in each genomic map. In each subsequence, the left marker and the right 
marker of each pair appear consecutively and compose a strip. Thus the total strip length is at least 2k. We 
refer to Figure [5tb) for an example. 

We next prove the "if" direction. Suppose that the d + 2 genomic maps G_5>, G+_, Gi, . . . , G^ have 
d + 2 subsequences of total strip length at least 2k. We will show that the set E has a subset of at least k 
pairwise-disjoint hyper-edges. By LemmalU each strip of the subsequences must be a pair of edge markers. 
Thus we obtain at least k pairs of edge markers that do not intersect each other in the genomic maps. Then, 
by Proposition IH the corresponding set of at least k hyper-edges in E are pairwise-disjoint. □ 



MSR-d: 
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6.2 L-reduction from (i-Dimensional-Matching to MSR-(rf + 2) 

We present an L-reduction (/, g, a, jS) from d-Dimensional-Matching to MSR-((i + 2) as follows. The func- 
tion /, given a set C Vi x • • • x 1/;^ of hyper-edges, constructs the d+2 genomic maps , ,Gi, . . . ,Gd 
as in the NP-hardness reduction. Let k* be the maximum number of pairwise-disjoint hyper-edges in E, and 
let /* be the maximum total strip length of any d + 2 subsequences of G^, G^, Gi, . . . , Gd, respectively. 
By Lemma |9l we have 

r = 2k*. 

Choose Q = 2, then property ^ of L-reduction is satisfied. 

The function g, given d+2 subsequences of the d+2 genomic maps G^,G^, Gi, . . . , Gd, respectively, 
returns a subset of pairwise-disjoint hyper-edges in E corresponding to the pairs of edge markers that are 
strips of the subsequences. Let / be the total strip length of the subsequences, and let k be the number of 
pairwise-disjoint hyper-edges returned by the function g. Then k > 1/2. It follows that 

\k* -k\ = k* -k< l*/2 - 1/2 = \l* - l\/2. 

Choose /3 = 1/2, then property ^ of L-reduction is also satisfied. 

We have obtained an L-reduction from d-Dimensional-Matching to MSR-((i + 2) with a/3 = 1. Hazan, 
Safra, and Schwartz [16] showed that d-Dimensional-Matching is NP-hard to approximate within Q{d/ log d) 
It follows that MSR-(i is also NP-hard to approximate within Q{d/ log d). This completes the proof of The- 
orem [T] 

7 A polynomial-time 2(i-approximation for MSR-o? 

In this section we prove Theorem|2l We briefly review the two previous algorithms ll27l[Tn for this problem. 
The first algorithm for MSR-2 is a simple heuristic due to Zheng, Zhu, and Sankoff ll27l : 

1. Extract a set of pre-strips from the two genomic maps; 

2. Compute an independent set of strips from the pre-strips. 

This algorithm is inefficient because the number of pre-strips could be exponential in the sequence length, 
and furthermore the problem Maximum- Weight Independent Set in general graphs is NP-hard. 

Chen, Fu, Jiang, and Zhu lITTI presented a 2(i-approximation algorithm for MSR-d. For any d > 2, a. 
d-interval is the union of d disjoint intervals in the real line, and a d-interval graph is the intersection graph 
of a set of d-intervals, with a vertex for each d-interval, and with an edge between two vertices if and only 
the corresponding d-intervals overlap. The 2d-approximation algorithm [11] works as follows: 

1. Compose a set of d-intervals, one for each combination of d substrings of the d genomic maps, respec- 
tively. Assign each d-interval a weight equal to the length of a longest common subsequence (which 
may be reversed and negated) in the corresponding d substrings. 

2. Compute a 2d-approximation for Maximum-Weight Independent Set in the resulting d-interval graph 
using Bar- Yehuda et al.'s fractional local-ratio algorithm |l6l. 

Let n be the number of markers in each genomic map. Then the number of d-intervals composed by this 
algorithm is Q{'n?'^) because each of the d genomic maps has Q{ii?) substrings. Consequently the running 
time of this algorithm can be exponential if the number d of genomic maps is not a constant but is part of the 
input. In the following, we show that if all markers are distinct in each genomic map (as discussed earlier, 
this is a reasonable assumption in application), then the running time of the 2d-approximation algorithm 
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can be improved to polynomial for all d > 2. This improvement is achieved by composing a smaller set of 
candidate d-intervals in step 1 of the algorithm. 

The idea is actually quite simple and has been used many times previously ||2T1 [T9l [TOl . Note that 
any strip of length Z > 3 is a concatenation of shorter strips of lengths 2 and 3, for example, 4 = 2 + 2, 
5 = 2 + 3, etc. Since the objective is to maximize the total strip length, it suffices to consider only short 
strips of lengths 2 and 3 in the genomic maps, and to enumerate only candidate d-intervals that correspond 
to these strips. When each genomic map is a signed permutation of the same n distinct markers, there are 
at most (2) + (3) = 0{n^) strips of lengths 2 and 3, and for each strip there is a unique shortest substring 
of each genomic map that contains all markers in the strip. Thus we compose only O(n^) d-intervals, and 
improve the running time of the 2(i-approximation algorithm to polynomial for all d > 2. This completes 
the proof of Theorem |2l 

8 Inapproximability results for related problems 

In this section we prove Theorem [3] and Theorem HI 

CMSR-3 and CMSR-4 are APX-hard. For any d, the decision problems of MSR-d and CMSR-d are 
equivalent. Thus the NP-hardness of MSR-d implies the NP-hardness of CMSR-d, although the APX- 
hardness of MSR-d does not necessarily imply the APX-hardness of CMSR-ti. Note that the two problems 
Max-IS-A and Min-VC-A complement each other just as the two problems MSR-d and CMSR-d comple- 
ment each other. Thus our NP-hardness reduction from Max-IS-3 to MSR-3 in Section|4]can be immediately 
turned into an NP-hardness reduction from Min-VC-3 to CMSR-3. 

We present an L-reduction {f,g, a, /3) from Min-VC-3 to CMSR-3 as follows. The function /, given 
a graph G of maximum degree 3, constructs the three genomic maps Gq , Gi , G2 as in the NP-hardness 
reduction in Section ID Let k* be the number of vertices in a maximum independent set in G, and let 1* 
be the maximum total strip length of any three subsequences of Go,Gi,G2, respectively. Also let c* be 
the number of vertices in a minimum vertex cover in G, and let x* be the minimum number of markers 
that must be deleted to transform the three genomic maps Go , Gi , G2 into strip-concatenated subsequences. 
Then k* + c* = n and I* + x* = An. By Lemma|5] we have /* = 2(n + k*). It follows that 

X* = 4n - r = 4n - 2(n + k*) = 2(n - k*) = 2c*. 

Choose a = 2, then property ^ of L-reduction is satisfied. 

The function g, given three subsequences of the three genomic maps Go, Gi, G2, respectively, trans- 
forms the subsequences into canonical form as in the proof of Lemma SI then returns a vertex cover in the 
graph G corresponding to the deleted pairs of vertex markers. Let x be the number of deleted vertex mark- 
ers, and let c be the number of vertices in the vertex cover returned by the function g. Then c < x/2. It 
follows that 

\c* -c\=c-c* < x/2 - x*/2 = \x* - x\/2. 

Choose /? = 1/2, then property (O of L-reduction is also satisfied. 

The L-reduction from Min-VC-3 to CMSR-3 can be obviously generalized: 

Lemma 10. Let A > 3 and d > 3. If there is a polynomial-time algorithm for decomposing any graph of 
maximum degree A into d — 1 linear forests, then there is an L-reduction from Min-VC-A to CMSR-d with 
constants a = 2 and (3 = 1/2. 

Recall that there exist polynomial-time algorithms for decomposing a graph of maximum degree 3 and 
4 into at most 2 and 3 linear forests, respectively ||2l[T]|3j. Thus we have an L-reduction from Min-VC- 
3 to CMSR-3 and an L-reduction from Min-VC-4 to CMSR-4, with the same parameters a = 2, /3 = 
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1/2, and aP = 1. Chlebfk and Chlebikova HI] showed that Min-VC-3 and Min-VC-4 are NP-hard to 
approximate within 1.0101215 and 1.0202429, respectively. It follows that CMSR-3 and CMSR-4 are NP- 
hard to approximate within 1.0101215 and 1.0202429, respectively, too. The lower bound for CMSR-4 
extends to CMSR-d for all d > 4. Note that we could use an L-reduction from Min-VC-3 to CMSR-4 
similar to the L-reduction from Max-IS-3 to MSR-4 in Section |3l but that only gives us a weaker lower 
bound of 1.0101215 for CMSR-4. 

CMSR-2 is APX-hard. Let p = 3 and q > 2. We present an L-reduction (/, g, a, /?) from Ep-Occ- 
Max-Eg-SAT to CMSR-2 as follows. The function /, given the Ep-Occ-Max-Eg-SAT instance (X,C), 
constructs the two genomic maps Gi and G2 as in our NP-hardness reduction in Section [51 As before, 
let k* be the maximum number of clauses in C that can be satisfied by an assignment of X, and let /* be 
the maximum total strip length of any two subsequences of Gi and G2, respectively. Also let x* be the 
minimum number of deleted markers. Then /* + x* is exactly the number of markers in each genomic 
map, that is, 2(5n + m + qm + 2). By Lemma |7l we have I* = 2(3n + m + /c* + 2). Thus x* = 
2(5n + m + qm + 2) — 2(3n + m + k* + 2) = 2{2n + qm — k*). Since a random assignment of each 
variable independently to either true or false with equal probability ^ satisfies each disjunctive clause of q 
literals with probability 1 — ^, we have k* > ^^^m. Recall that np = mq. It follows that 

X* = 2(2n + qm - k*) = 2 (^2^ + q^ m - 2k* < ^2 ^2 ^ + q^ ^^-J - 2^ A:*. 

For p = 3 and g' = 2, we can choose a = 2(2 | + q)-^^ — 2 = 62/9. Then property ([T]l of L-reduction is 
satisfied. 

The function g, given two subsequences of the two genomic maps Gi and G2, transforms the subse- 
quences into canonical form as in the proof of Lemma |6l then returns an assignment of X corresponding 
to the choices of true or false markers. Let / be the total strip length of the subsequences, and let x be the 
number of deleted markers. Let k be the number of clauses in C that are satisfied by this assignment. Then 

\k* -k\ < \l* -l\/2 = \x* -x\/2. 

Choose /3 = 1/2. then property Q of L-reduction is satisfied. 

Berman and Karpinski 18] showed that E3-Occ-Max-E2-SAT is NP-hard to approximate within any 
constant less than ||| = ^33^404. Since a/3 = 31/9, CMSR-2 is NP-hard to approximate within any 
constant less than 

1 + (l/464)/(31/9) = 1 + 9/14384 = 1.000625 .... 

An asymptotic lower bound for CMSR-d and a lower bound for CMSR-d with unbounded d. Chlebfk 
and Chlebfkova |[T2l showed that for any A > 228, Min-VC-A is NP-hard to approximate within | — 
0(log A/A). By the second inequality in ©, it follows that if A < 227, then /(A) < [3 [227/2] /2] = 
171. Consequently, if /(A) > 172, then A > 228. By Lemma[l0l there is an L-reduction from Min-VC-A 
to CMSR-(/(A) + 1) with a = 2 and /? = 1/2. Therefore, for any d > 173, CMSR-d is NP-hard to 
approximate within | — O {log d/d). 

The maximum degree A of a graph of n vertices is at most n — 1. Again by the second inequality in (O, 
we have /(A) < \3\{n — l)/2]/2]. Thus /(A) is bounded by a polynomial in n. If d is not a constant 
but is part of the input, then a straightforward generalization of the L-reduction from Min-VC-3 to CMSR-3 
as in Lemma [To] gives an L-reduction from Minimum Vertex Cover to CMSR-(/(A) + 1) with a = 2 and 
/? = 1/2. Dinur and Safra lfT4l showed that Minimum Vertex Cover is NP-hard to approximate within any 
constant less than 10\/5 — 21 = 1.3606 .... It follows that if d is not a constant but is part of the input, then 
CMSR-d is NP-hard to approximate within any constant less than 10\/5 — 21 = 1.3606 .... This completes 
the proof of Theorem [3] 
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Inapproximability of t^-gap-MSR-d and ^-gap-CMSR-d. It is easy to check that all instances of MSR-d 
and CMSR-d in our constructions for Theorem [Hand Theorem [3] admit optimal solutions in canonical form 
with maximum gap 2, except for the following two cases: 

1. In the L-reduction from Ep-Occ-Max-Eg-SAT to MSR-2 and CMSR-2, a strip that is a pair of hteral 
markers has a gap of q — 1, which is larger than 2 for g > 4. 

2. In the L-reduction from d-Dimensional-Matching to MSR-((i+2), a strip that is a pair of edge markers 
may have an arbitrarily large gap if it corresponds to one of many hyper-edges that share a single 
vertex. 

To extend our results in Theorem [Hand Theorem [3] to the corresponding results in Theorem HI the first 
case does not matter because we set the parameter g to 2 when deriving the lower bounds for MSR-2 and 
CMSR-2 from the lower bound for E3-Occ-Max-E2-SAT. 

The second case is more problematic, and we have to use a different L-reduction to obtain a slightly 
weaker asymptotic lower bound for (5-gap-MSR-d. Trevisan |25| showed that Max-IS-A is NP-hard to 
approximate within A/2*^(^^°s^)_ By Lemma [3j there is an L-reduction from Max-IS-A to 5-gap-MSR- 
(/(A) + 2) with a/3 = 1. By the two inequalities in Q, we have /(A) + 2 = e(A). Thus 5-gap-MSR-d 
is NP-hard to approximate within d/2^^^^°^'^\ This completes the proof of Theorem |4l 

9 Concluding remarks 

A strip of length / has / — 1 adjacencies between consecutive markers. In general, k strips of total length 
/ have / — k adjacencies. Besides the total strip length, the total number of adjacencies in the strips is also 
a natural objective function of MSR-d (TDi . It can be checked that our L-reductions for MSR-d and 6-gap- 
MSR-d still work even if the objective function is changed from the total strip length to the total number of 
adjacencies in the strips. The only effect of this change is that the constant a is halved and correspondingly 
the constant f3 is doubled (from 1/2 to 1). Since the product af3 is unaffected, Theorem[T]and the second part 
of Theorem HI remain valid. For Theorem |2l we can adapt the 2(i-approximation algorithm for maximizing 
the total strip length to a {2d + e) -approximation algorithm for maximizing the total number of adjacencies 
in strips, for any constant e > 0. The only change in the algorithm is to enumerate all (i-intervals of strip 
lengths at most 0(l/e), instead of 2 and 3. We note that the small difference between the two objective 
functions, total length versus total number of adjacencies, has led to difference in the complexities of two 
other bioinformatics problems 11211 [T9l : For RNA secondary structure prediction, the problem Maximum 
Stacking Base Pairs (MSBP) maximizes the total length of helices, and the problem Maximum Base Pair 
Stackings (MBPS) maximizes the total number of adjacencies in helices. On implicit input of base pairs 
determined by pair types, MSBP is polynomially solvable, but MBPS is NP-hard and admits a polynomial- 
time approximation scheme 11211 : on explicit input of base pairs, MSBP and MBPS are both NP-hard, and 
admit constant approximations with factors 5/2 and 8/3, respectively fl9l . 

In our Theorem [H and Theorem [3l we have chosen to display explicit lower bounds for MSR-2 and 
CMSR-2, despite the fact that they are rather small and unimpressive. As commented by M. Karpinski 
after the author's ISAAC presentation, it may be possible to improve the lower bound for MSR-2 by an L- 
reduction from another problem. For example, Berman and Karpinski [8] proved that E3-Occ-Max-E2-SAT 
is APX-hard to approximate within any constant less than ||| by an L-reduction from E(i-0cc-E/i;-LIN-2, 
and proved that E(i-0cc-EA;-LIN-2 is NP-hard to approximate within some other constant by an L-reduction 
from yet another problem, and so on. By constructing an L-reduction directly from E(i-0cc-E/c-LIN-2 to 
MSR-2, say, we might obtain a better lower bound. We were not engaged in such pursuits in this paper. Since 
satisfiability problems are well-known, we chose an L-reduction from E3-Occ-Max-E2-SAT to MSR-2 for 
the sake of a gentle presentation, and we made no effort in optimizing the constants. 
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We proved Theorem|4]by extending our proofs of Theorem[T]and Theorem|3]with minimal modifications. 
We note that tlie (5-gap constraint actually makes it easier to prove the APX-hardness of 5-gap-MSR-(i and 
(^-gap-CMSR-d than to prove the APX-hardness of MSR-d and CMSR-d. For example, our E3-0cc-Max- 
E2-SAT constructions for MSR-2 and CMSR-2 can be much simplified to obtain better approximation lower 
bounds for ^-gap-MSR-d and (^-gap-CMSR-d. We omit the details and refer to ITOl for more results on 
these restricted variants. On the other hand, the correctness of our reductions does require gaps of at least 2 
markers. Thus our proofs do not imply the APX-hardness of l-gap-MSR-d or l-gap-CMSR-d. Consistent 
with our results, Bulteau, Fertin, and Rusu lITOll proved that (5-gap-MSR-2 is APX-hard for all (5 > 2 and is 
NP-hard for = 1. 

A curious concept called paired approximation was recently introduced by Eppstein |[T5l . For certain 
problems on the same input, say Clique and Independent Set on the same graph, sometimes we would be 
happy to find a good approximation to either one, if not both. Inapproximability results for pairs of problems 
are often incompatible: the hard instances for one problem are disjoint from the hard instances for the other 
problem. As a result, an approximation algorithm may find a solution to one or the other of two problems 
on the same input that is better than the known inapproximablity bounds for either individual problem. Note 
that our inapproximability results for MSR-2 and CMSR-2 are compatible because they are obtained from 
the same reduction from E3-Occ-Max-E2-SAT. Thus even as a paired approximation problem, (MSR-2, 
CMSR-2) is still APX-hard. This is the first inapproximability result for a paired approximation problem in 
bioinformatics. 

Postscript. The APX hardness results for MSR-2 and MSR-3 in Theorem [T] was obtained in December 
2008. The author was later informed by Binhai Zhu in January 2009 that Lusheng Wang and he had inde- 
pendently and almost simultaneously proved a weaker result that MSR-2 is NP-hard ll26l . 
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